arXiv:1506.04692vl [math.ST] 15Jun2015 


A new V-fold type procedure based on robust tests 


Lucien Birge*^, Nelo Magalhaes^^’^, and Pascal Massartl^’^ 

^LPMA, UPMC Universite Paris 06 
^Equipe Probabilites et Statistiques, Universite Paris-Snd 11 
^INRIA team Select 


Jnne 2015 


Abstract 

We define a general V-fold cross-validation type method based on robust tests, which 
is an extension of the hold-out defined by Birge [7, Section 9]. We give some theoretical 
results showing that, under some weak assumptions on the considered statistical proce¬ 
dures, our selected estimator satisfies an oracle type inequality. We also introduce a fast 
algorithm that implements our method. Moreover we show in our simulations that this 
V-fold performs generally well for estimating a density for different sample sizes, and can 
handle well-known problems, such as binwidth selection for histograms or bandwidth se¬ 
lection for kernels. We finally provide a comparison with other classical V-fold methods 
and study empirically the influence of the value of V on the risk. 


1 Introduction 

The purpose of this paper is to offer a new method to solve the following problem. Suppose 
we are given i.i.d. observations from an unknown distribution Pg to be estimated. This 
distribution is often assumed to have a density s with respect to some given measure p, 
hence our notation, but we shall also consider the case when Pg is not absolutely continuous 
with respect to p, keeping the same notation Pg for the true distribution, in which case the 
subscript s just indicates that Pg is the distribution of the observations. 

We also have at hand a family of statistical procedures or algorithms {Am, m G Al} 
that can be applied to the observations in order to derive estimators of Pg. How can we use 
our data in order to choose one potentially optimal algorithm in the family, provided that 
a criterion of quality for the estimators has been chosen? Let us now be somewhat more 
precise. 
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1.1 The problem of procedure choice 

We observe an n-sample X = {Xi,..., Xn} of random variables Xi with values in the mea¬ 
sured space {X,£) and we assume (temporarily) that the distribution Pg = s ■ fj, of the Xi 
admits a density s with respect to some given positive measure /r on X and that s belongs to 
some given subset S of Li(/i). The purpose here is to use the observations in order to design 
an estimator s = ^(X) of s. 

There is a huge amount of strategies for solving this estimation problem, depending on 
the additional assumptions one makes about s. We shall use the notion of statistical pro¬ 
cedure (procedure for short), also denoted statistical algorithm in what follows, in order to 
properly formalize these strategies. Following [1], we define a procedure or an algorithm as 
any measurable mapping A from ljfc>i fo S. Such a procedure associates to any random 
sample G an estimator s^ = .A(Yfc) G Ibi(/r) of s. A classical criterion from decision 
theory used to measure the quality of a procedure A based on an i.i.d. sample of size k when 
s obtains is its risk: [£(s, A(Yfc))], where i is some given loss funetion and E^ denotes the 

expectation when s obtains, i.e. when the distribution of Y^ is The smaller the risk, 

the better the procedure A. 

To define the risk of a procedure one can consider various loss functions. Some popular 
ones are derived from a eontrast funetion 7 (see [9, Definition 1]) which is a mapping from 
5 X Y to M such that s minimizes over S the function t 1 —)■ E^ ['y(t,X)]. The loss £ at t is 
then defined as 

£(s, t) = Es [ 7 (t,X) — 7 ( 5 ,X)] > 0 for all t G 5, (1) 

hence f(s, s) = 0. The L 2 -I 0 SS derives from the choice S = E 2 (/x) nLi(;u) and 'y{t, x) = ||t|p — 
2t{x), where ||t|| = [Jp^,t‘^dp] ' denotes the L 2 -norm. The Kullback-Leibler loss corresponds 
to the contrast function 'y{t,x) = — log(t(x)) with S being the set of all probability densities 
with respect to fi. 

In this paper, we consider the problem of proeedure seleetion. Let (Am)meM denote a 
collection of candidate statistical procedures. Our goal is to choose from the observations 
X one of these procedures, that is some m(X) G Af, in order to have the most accurate 
estimation of s. If we apply all these procedures to the sample X we get the corresponding 
collection of estimators {sm = Am(X), m G Ai}. Given a loss £, the best possible choice for 
m would be to select m* G Ai such that 

E,[£(s,?^*(X))] = inf E,[£(s,?^(X))]. 

Unfortunately, since s is unknown, all the risks Es[£(s,Sm)] are unknown as well and we 
cannot select the so-called oraele algorithm Am*- One can only hope to choose fh = fhfX.) in 
such a way that Eg [I'('S,s^^)] is close to Eg [£(s,Sm*)]- 

To make this presentation more explicit, let us mention some classical estimation problems 
that naturally fit into it: 

• Bandwidth seleetion (see [11, Chapter 11]). Let Y = M, /i be the Lebesgue measure, K : 
M — )■ M a given nonnegative function satisfying K{x) dx = 1 and % = {hm,m € Ai} 
be a finite or countable set of positive bandwidths. We define the kernel algorithm Am 
as the procedure that produces from any sample Y^. of size k a kernel density estimator 
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with bandwidth hm, which means that 


Am0^k){x) 



E ^ 

Yi&Yk 



for all X G M. 


The problem of choosing a best estimator among the family {sm, m € M} amounts to 
select a “best” bandwidth in T-L, that is one that minimizes the risk [^(s,s^m)] with 
respect to m. 

• Model selection (see [ 16 ]). We recall that a model S for s is any subset of S. It 
follows from (1) that minimizing, for t in S', the loss i{s,t) derived from the contrast 
function 7 amounts to minimizing t 1—?■ Eg ['y(t,X)] over S. Since s is unknown, this 
is impossible but if we replace E^ [7(t, X)] by its unbiased empirical version: 7n(t) = 
n~^ SiLi 7(T Ali) we can derive an estimator with values in S by minimizing 7n(t) over S 
instead. This procedure 7I5 is a minimum contrast algorithm that provides a minimum 
contrast estimator ss{X.) G argmin^g^ 7„(t) on S. Using for instance, the Kullback- 
Leibler contrast on a set S of densities leads to the so-called “maximum likelihood 
estimator” on S. 

If we have at hand some finite or countable collection of models {Sm}m&M a suitable 
contrast function 7 we may associate in this way to each model Sm a minimum contrast 
algorithm Am and the corresponding minimum contrast estimator Sm(X). The problem 
of “model selection” is to select from the data a “best model” (one with the minimal 
risk) in the family, leading to a “best” possible minimum contrast estimator. 


Instead of deriving the loss function t from a contrast function we may use for ^ the 
squared Hellinger distance provided that our estimators Sm are genuine probability densities. 
We recall that the Hellinger distance h and the Hellinger affinity p between two probabilities 
P and Q defined on X are given respectively by 


h{P,Q) = 


\/ZP- v^) 


21 V2 


and p(P, Q) 


^/dPdQ = l-h^{P,Q), (2) 


where dP and dQ denote the densities of P and Q with respect to any dominating measure 
(the result being independent of this choice). One advantage of this loss function lies in the 
fact that h is a distance on the set V of all probabilities on X and therefore does not require 
that Pg be absolutely continuous with respect to p, which is one of the reasons why we shall 
use it in the sequel. In this case we take for S a set of probability densities with respect to 
p and we set, for all t in 5 and Pt = t ■ p, i{s,t) = ffi{Ps,Pt) which we shall write ffi{s,t) 
for simplicity. We shall also write p{t,u) for p{Pt,Pu)- This loss then leads to the quadratic 
Hellinger risk. 


1.2 Cross-validation 

The biggest difficulty for selecting a procedure in a given family {Am, m G At} comes from 
the fact that we use the same data X to build the estimators Sm(X) and to evaluate their 
quality. It is indeed well-known that evaluating the statistical performance of a procedure 
with the same data that have been used for the construction of the corresponding estimator 
leads to an overoptimistic result. One solution to avoid this drawback is to save a fraction of 
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the initial sample to test the output of the procedures Am on it. This is the basic idea behind 
cross-validation (CV) which relies on data splitting. 

The simplest CV method is the hold-out (HO) which corresponds to a single split of 
the data. The set X is divided once and for all into two non-empty proper subsets X* and 
X’' = X \ X* to be called respectively the training and the validation sample. First, with the 
training sample X*, we construct a set {.Am(X*), m E Af} of preliminary estimators. Then, 
using the validation sample X*^, we choose a criterion in order to evaluate the quality of each 
procedure Am from the observation of A.m(X*). Finally, we select rh{X.'^) minimizing this 
criterion over Al. Depending on the author, the final estimator might be either A^(X*) (as 
in [11]) or A-(X) (as in [ 2 ]). All CV methods are deduced from the HO: instead of using 
one single partition of our sample, we use different partitions, compute the HO criterion for 
each one and finally define the CV criterion by averaging all the HO criteria. The goal, by 
considering several partitions instead of one, is to reduce the variability with the hope that 
the CV criterion will lead to a more accurate evaluation of the quality of each procedure. 

We shall focus here on V-fold cross-validation (VFCV) which corresponds to a particular 
set of data splits^. One divides the sample X into V > 2 disjointed and therefore independent 
subsamples Xj, j = 1 ,... ,V, of the same size p = n/V (assuming, for simplicity, that p is 
an integer) so that X = Ujli ^j- For each split j E { 1 ,..., V}, one uses X^ to build the 
family of “partial estimators” {smj = •^m(Xj), m E Al} and the corresponding validation 
sample Xj to define an evaluation criterion critj(m) = critj(m)(Xj) of the procedure A.m(X^) 
corresponding to the partition (Xj,Xp of the data. One finally selects a strategy mvF 
minimizing the averaged criterion: 

mvF £ argmincrit(m) with crit(m) = 

niGM 

There are as many V-fold procedures as there are different ways to define critj(m). If 
we work with a loss of the type ( 1 ), the best estimator in the family {sm,j, "m E Al} is the 
one minimizing the loss, i.e. the one minimizing [7(sm,j, A)] (with X being independent 
of Xj). A natural idea for evaluating this quantity that we cannot compute since we do not 
know s is to estimate it by its empirical version based on the independent sample Xj of size 
p, which leads to the criterion 

critj(m) = - ^ -f{sm,j,Xi). 


y 


-^critj(m). 


i=i 


In this classical context, we naturally select the statistical procedure with the lowest estimated 
average loss crit(m). The choice 'y{t,x) = — log(t(x)) leads to the Kullback-Leibler V-fold 
(KLVF) whereas 7(t, x) = ||t|p — 2 t(x) provides the Least-Squares V-fold (LSVF). The chosen 
estimators will be respectively denoted tuKLVF and mnsYF and the relevant classical criterion 
will be denoted critvFCV in what follows. 


1.3 An alternative criterion 

When the chosen loss function that we use is the squared Hellinger distance, an alternative 
empirical criterion to evaluate the quality of an estimator has been proposed by Birge [ 5 ] 

^The concerned reader should have a look at the survey of Arlot and Celisse [1] to get a complete overview 
of other CV methods. 
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following ideas of Le Cam [12,13] to process estimator selection. An alternative method was 
later introduced by Baraud [3]. An HO strategy based on this criterion was first proposed 
by Birge in [7], this latter procedure being recently implemented in [14]. The idea behind 
the construction is as follows. Suppose we have at hand a set T of densities with respect 
to /r and, for each pair (t,u), t 7 ^ u, of points of T, a test between t and u = t 
meaning accepting t). Given a sample X we may perform all the tests and consider 

the criterion T){t) defined on T by 

V{t)= sup (3) 

u^'T, u^t 

It immediately follows from this definition that 


h{t,u) < max{I?(t),for all t,u€T. 


(4) 


This definition means that 'D{t) is large when there exists some u which is far from t and 
which is preferred to t by the test suggesting that t is likely to be far from s, at 

least if s does belong to T. In order that this be actually true even if Pg does not belong to 
{Pt,t G T}, it is necessary to design suitable tests. It has been shown in [5] that one can 
build a special test between the two Hellinger balls B{t, r) and B{u, r) with r < h{t, u) /2 
(where B{t, r) denotes the closed ball of center t and radius r in the metric space {V, h)) which 
posesses the required properties. With this special choice of tests for all pairs (t, u), V{t) 
becomes indeed a good indicator of the quality of t as an estimator of s (the smaller P{t), 
the better t) and, more generally, of Pt as an estimator of Pg even if Pg is not absolutely 
continuous with respect to /r. This property of V suggests to define the following criterion 
on which to base a new VFCV procedure. Starting from the family of preliminary density 
estimators 

[sm,j = AmO^j), m G M, l< j <V^ , 

we build all the corresponding tests V't A . (^ 7 ) i hereafter denoted for simplicity by 

L,j 1 Tn,j 

between the densities sij and Sm,j for /,m £ A4, / 7 ^ m. Then, for each j and m, we define 
the criterion critj(m) by 


critj(m) = T’|(m) with 'Dj{m) = ^ h(szj, 


(5) 


We then naturally define our test-based V-fold criterion as 


1 ^ 

critTVF(”i) := 'B {m) = — for all m G M. 

^ j=i 

Up to our knowledge, this is the first V-fold type procedure based on the Hellinger distance. 
Note that this construction requires that the estimators Sm,j be genuine probability densities 
with respect to /r which we shall assume from now on. 


1.4 Organization of the paper 

Our goal is to study our new VFCV procedure from both a theoretical and a practical point 
of view. Section 2 is dedicated to its theoretical study. In Section 3 we discuss in details 
the implications of the resulting risk bounds to the case of histogram estimators, applications 
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to kernel estimators and an extension to algorithms that do not lead to genuine probability 
density estimators. Section 4 contains an empirical study of the influence of the value of V on 
the performance of our procedure in terms of Hellinger risk and also comparisons with classical 
V-fold and some especially calibrated procedures. Section 5 describes the fast algorithm that 
we have designed and implemented in order to compute the selected estimator efficiently. 
Finally Section 6 contains a proof of the bounds for the Hellinger risk of kernel estimators. 
We provide some additional simulations in Section A. 


2 T-V-fold 

As already mentioned, the method proposed in [7] is based on tests and it results in what 
Birge called T-estimators (T for “test”). We shall therefore call our cross-validation method 
based on the same tests T-V-fold cross-validation (TVF for short). 


2.1 Tests between Hellinger balls 

The tests that we use for our procedure satisfy the following assumption, which ensures their 
robustness. We recall that S is the set of all probability densities with respect to /j,. 

Assumption (TEST). Let 6 G (0,1/2) be given. For all t and u in S, z ^ M. and r = 
6h{t, u) there exists some test statistie depending on t, u, 6 and X with the following 

properties. The test between t and u defined by 


V’t.^x) = 


t If Tt^u,eO^) > ^ 
u if < z ’ 


z G 


with an arbitrary choiee when = z, satisfies 


and 


sup Ps ['i/t „(X) = u] < exp —n{l — 29ym{t,u) + z 

{Ps(^V\h(s,t)<r} ’ L 


sup Ps ['i/’t,u(X) = t] < exp —n(l — 20)^/i^(t, u) — z 
{ Ps&V I h{s,u)<r } 


where P^ denotes the probability that gives X the distribution P®"'. 
Any test satisfying (7) and (8) will be suitable for our needs. 


( 6 ) 

(7) 

( 8 ) 


Tests between balls In order to define tests between two Hellinger balls B{t, r) and B{u, r) 
with r = 6h{t,u), 0 < 9 < 1/2, Birge introduced the following test statistic 


7/,„,e(X) = ^log 

^=1 


/ sin(a;(l - 6>))\/t(Ai) -|- sm{uJ9)^/u{Xi) \ 
\ sin(a;(l - 9))y/u{Xi) -|- sm{uj9)y/i{Xi) ) 


with uj = arccos p{t, u). (9) 


We should notice that for 9 = 0, the test given by (9) is exactly the likelihood ratio test 
between t and u. The fact that Assumption (TEST) holds for this test whatever 9 G (0,1/2) 
has been proven in [6] and a more up-to-date version is to be found in [8, Corollary 1]. 
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2.2 TVF estimators 

Let {Am)meM denote some collection of positive numbers satisfying 

I 

> 0 for all m G M., and - < L = ^ exp(—A^) < oo. (10) 

m£M 

Starting from the family of estimators Sm,j defined in Section 1.3, we consider the correspond¬ 
ing tests with t = sij, u = Sm,j and z = Ai — Am in (6). This 

results in the estimator derived from the procedure Az, with 

TnxvF ^ mxvF 


^ • 7f;2 /X .1 

tutyf £ argmm [m) = argmm — 

mGj\4 F 


i=i 


mj 


( 11 ) 


2.3 Assumption on the family of procednres 

The idea of V-fold relies on the heuristic that, for each procedure Am, the observation of 
V partial estimators Sm,j, 1 < j < F based on samples of size n — p with p = njV allows 
to predict the behavior of an estimator Sm based on an n-sample. This requires that there 
exists a link between the loss of 'sm and the losses of the Sm,j- We shall need the following 
assumption on the collection of procedures we consider. 

Assumption (LOSS). For all procedures Am with m G Ai, the loss at s satisfies 

1 V' 

h ( S, Sm ) ^ ^ 'y ( h {s, Sm,j ) • 

^ i=i 


This implies in particular that R{Am, n, s) < R{Am, n — p,s), where 


R{A, k, s) = Es 


h^[s,A{Yk)) 


denotes the risk at s of the procedure A based on a sample of size k. Assumption (LOSS) is 
in particular satisfied by the “additive estimators” of [11, Chapter 10]. 

Definition 1. An additive estimator s = s(X) derived from a sample X of size n is an 
estimator that can be written in the form: 


1 " 

six) = — JCix, Xi) for all x G A, 

n ~ 

t=l 


( 12 ) 


where /C is a real valued function from A x A to M. 


There is a huge amount of literature about these estimators which already appeared in an 
early version in Whittle [21]. The first results about their asymptotic properties in general 
were made by Watson and Leadbetter [20], followed by Winter [22] and Walter and Blum [19] 
who established rates (the latter authors called them delta sequence density estimators). They 
were introduced in the context of CV by Rudemo [18] and used by Marron for comparison of 
CV techniques [15]. As shown in [19] and [11], additive estimators include in particular: 
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• Histogram estimators. Given a partition {1^; A G A} of A" with 0 < < +oo for all 

A one defines the histogram estimator based on this partition as 

It corresponds to the case of JC{x,Xi) = 

• Parzen kernel estimators on the line. Set JC{x,Xi) = w~^K [w~^{Xi — x)) for a given 
nonnegative kernel K with J^K(x)dx = 1 and a positive bandwidth w. Then (12) 
leads to a density estimator with respect to the Lebesgue measure on M. 

It is straightforward to check that if the procedure Am results in additive estimators, the 
following relationship which says that the estimator built with the whole sample is exactly 
the convex combination of the V partial estimators holds: 

1 ^ ^ 

Sm = y ^ (14) 

i=i 

As a consequence, we get the following elementary property: 

Proposition 1. Any proeedure Am whieh results in additive estimators does satisfy Assump¬ 
tion (LOSS). 

Proof. It follows from (14) and the concavity of the square root function that 


p(s, Sm) — P 



V 




’m,j 


J=1 



p(s, Smj), 


which is exactly Assumption (LOSS) in view of (2). 


□ 


2.4 The main result 

Assumption (LOSS) ensures that, for the procedures we consider, the loss of some estimator 
is bounded by the mean of the losses of the partial estimators. This motivates us to work 
separately on each split j G {1,..., P} and then to deduce a risk bound for the estimator 
built with the whole sample. It is therefore natural to study for each j the deviations of the 
random variable Pj(-)- A deviation inequality for 7? has been proven in Theorem 9 of [7]. Let 
us now recall it and provide a short proof for the sake of completeness. 

Proposition 2. Let (Am)meM ® eolleetion of weights satisfying (10) and 


A = 


n(l - 20)\ 
2P ’ 


ymj = max 




Then, for all m G A4, and j £ {1,... ,V}, 


Vj{m) > y 




< T exp —2Ay‘^ + 


for ally > ym,j- 
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Proof. Let us fix some m € M and j G {1,..., ^} and work conditionally to the training 
sample so that the collection of estimators can be considered as fixed. We 

perform the test that satisfy Assumption (TEST) with z = A; — Am in (7). Then 


V,{m) > y I 


3 I G M. such that h{'sij,Sm,j) > y and = I \ 


< 

< E 


exp 


=IIX^ 

2A.h ( Sij , Sm,j ) ( A; Am ) 


l^M-. h(sij,7mj)>y 

< exp -2Ay^ + Amj ^ exp(-A;) < T exp -2Ay^ + A^ 

leM 

where we successively used the fact that y > ym,j A d~^h{s,Sm,j) and ( 10 ). 


□ 


For each fixed j, that is conditionally to each X^, we deal with some “fixed geometrical 
configuration” since the points {sm,j)meM are given, conditionally to X^. On this configura¬ 
tion, Proposition 2 controls the deviations of 'D‘j{m) which allows us to bound the expectation 
of T> (m). This results in the following theorem. 

Theorem 1. Under Assumption (LOSS), the estimator = A^^^^^(X) with mTVF 


_ 2 

minimizing the eriterion T) (m) satisfies the following inequality: 


mxvF' 


E., 


h^ ( s,s^ ) 

V ’ mxvF / 


< inf < 2 
ra&M 


02+2 


R ( •^mi 


F- 1 


V 


-n,s + 


4F[A^ + log(2r) + l] 

n(l-20)2 


(15) 


Proof. Let m' be any point in Al. It follows from (4) that, for all m G Af and 1 < j < V, 
h{s,Sm',j) <h{s, Sm,j ) + h{ Sm',j, Sm,j ) <h{s, Sm,j) + max ( Vj (m), Vj (m')) . 
Setting m' = mTVF = m for short, we derive that 

E J ) ^ 2 I ^ /i2 (S, s™J ) + :^ ^ max ( V] (m), (m)) 




i=i 


< 2 


i=i 


V 


j=i 


V 


< yJ^h^(s,Sm,j) +4V (m), 

^ 1=1 

for all m G Al. Using Assumption (LOSS) and taking expectations, we derive that 

V 


Es {s,s-) < 2i?(Am,n-p,s)-hdE^ 

^ 1=1 

since the risk of Sm,j is the same for all j and equal to R{Am, n — p, s). 




(16) 
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Let now m and j be fixed. Integrating the bound for > y provided by 

Proposition 2 with respect to y leads to 


E., 






<yi,j+re^- 


'y\ 


r-l . 

^ e“2 ^ dz < + —^ exp 




and, since Aymj ^ ^r. 


Finally 


V'^Am) 


< E., 


'iP 

J 


+ rA ^exp(-Am) < ^E^ h'^{s,Sm,j) 


Am + Fe 


— Ari 


A 


E., 


V^(m) < ^R(Am, n-p,s) 


An 


One should then observe that changing Am into Am + B with B > 0 does not change the 
procedure since the tests only depend on differences Am — A;. Since the new weights Am + B 
also satisfy (10) with F changed to Fe“'®, the previous bound remains valid for the new 
weights leading to 


E., 


V^{m) < ^R{Am, n-p,s) + 


-Am-2B 


An optimization with respect to B (taking into account the fact that F > 1/2) together with 
(16) leads to our conclusion. □ 


2.5 Comments 

At this stage, several comments are in order: 


A simple case It is often the case that At is finite with cardinality |At| and that we use 
equal weights Am = A < log(21At|) for all m G At, in which case F = |At|e“^ which leads 
to the following risk bound which only depends on |At|: 


E., 


h'^ is, s. 


mxvF 


< 2 


02+2 


inf R 
m£M 



V-1 


-n, s 


4Flog(2e|At|) 
n(l - 20)2 ” 


Modified V-fold Unfortunately, there are actually many estimators, like maximum likeli¬ 
hood estimators or T-estimators, that do not satisfy Assumption (LOSS) and for which the 
previous risk computations fail. In order to solve this problem, one should think about the 
initial purpose of VF methods and, more generally, of procedure selection. Starting from the 
family {Am, m G A4}, we want to determine, at least approximately, the best procedure for 
the problem at hand. But if we design an alternative procedure A not contained in the initial 
set, but as good as the best one in the set, we may consider that we have achieved our goal. 

It should be noted at this stage that Assumption (LOSS) is only used to derive in (16) 


that 


E., 


A a; 


1 


V 


i=i 
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which, in view of Proposition 1, holds as soon as ^ natural solution 

to deal with any family of estimators that do not satisfy Assumption (LOSS) is therefore as 
follows. Define the partial estimators Smj and determine mTVF as before by (11), then define 
the final TVF-estimator stvf by 


STYF-V 

1=1 

so that (16) is satisfied and the proof proceeds as before; our modified TVF-estimator stvf 
satisfies the conclusion of Theorem 1. 


Extension It should be noted that the following analogue of (15) holds (with the same 
proof) 


E., 


I ’ mxvF / 


< 


inf 

ra&M 


Ci{e,a)R 


V- 1 


V 


re, s) -|-C' 2 ( 6 ',a) 


V(A^ + log(2F) + 1) 


re 


if we replace Assumption (TEST) by the following 

Assumption (TEST’). Let 6 G (0,1/2) and a > 0 be given. For all t and u in S and 
r = 9h{t,u) there exists some test statistic Tt^u,e{^) depending on t,u,9 and X with the 
following properties. The test Tpt,u between t and u defined by 


t if ^ 

u if Tt-afiiX) < z ’ 


2 ; G 


with an arbitrary choice when satisfies 

sup Ps [V’t,u(X) = re] < exp —nahfi{t,u) + z 
{ Ps&V I h{s,t)<r } 


and 


sup Ps [fit,uO^) = t] < exp —nah‘^{t,u) — z 
{Ps^P I h{s,u)<r} 


where P^ denotes the probability that gives X the distribution P®"'. 


In particular Baraud introduced in [3] and for the same purpose of estimator selection 
the following statistic that relies on a variational formula for the Hellinger affinity. For 
r = {t + u) /2, let 


7/,n(X) 


1 /1 " yijXi) - ^(Xj) 
2[n^^ ^(X,) 


+ 


- \Ju{x) ] dfj,{x 


(18) 


The corresponding test fit,u actually satisfies Assumption (TEST’) for small enough constants 
9 and a. This follows from Baraud (2008, unpublished manuscript). Therefore the test 
fi{t, re) derived from Baraud’s statistic could be used instead of the tests between balls. Some 
simulations based on this alternative test will be provided in Section A. 
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3 About the choice of V 


Let us now come back to the bound (15). It follows from our empirical study in Section 4.2 
below that a good choice of 0 is 1/4. Therefore assuming, to be specific and for simplicity, 
that 0 = 1/4 and that 

log(2r) + 1 < 3Am for all m G A4, (19) 


(15) becomes 


E., 


^ s,. 


mj'Yp 


< 66 inf 
m&M 


R A. 


V-1 


V 


-n, s + 


FA, 


n 


( 20 ) 


Although this risk bound is certainly far from optimal in view of the large constant 66 and 
our extended simulations show that the actual risk is indeed substantially smaller, it is never¬ 
theless already enlightening. To see it, let us begin with the simple case of regular histograms. 


3.1 Regular histograms 

Let us analyze the problem of estimating an unknown density s with respect to the Lebesgue 
measure on [0,1] from n i.i.d. observations with density s. We consider, for each positive 
integer m, the histogram estimator Sm based on the partition Zm of [0,1] into m intervals of 
equal length m~^. It is known from [10, Theorem 1] that the risk at s of the histogram Sm 
built from n i.i.d. observations is bounded by 


E., 


h (s,Sm) ^ h (s,Sm) T 


m — 1 
2 n ’ 


( 21 ) 


where Sm is the L2-projection of s onto the m-dimensional linear space of piecewise constant 
functions on the partition Zm- It is also shown in this theorem that this bound is asymptot¬ 
ically optimal, up to a factor 4, since the asymptotic risk (when n goes to infinity) is of the 
form 


E,, 


h^(s,Sm) =h^(s,Sm) + 


m — 1 
8n 


(l + o(l)). 


( 22 ) 


In view of (22), the bound in (21) can be considered as optimal, up to a constant factor and 
it follows from (21) that 


R 



V-1 


V 


-n, s 


< (s, Sm) + 


(m — 1)V 
2n{V - 1) 


h‘^{s,Sm) 


m — 1 m — 1 
2n 2n{V - 1) 


(23) 


and that 


inf Es 
m£A4 




) 


< {s, Sm* ) + 


{m* - 1) 
2 n 


inf 

m£M 


h ( S, Sm ) T 


(m — 1) 

2^ j ’ 


(24) 


where this last bound can be considered as a benchmark for the risk of any selection procedure 
applied to our family of histograms. Since the Hellinger distance is bounded by 1, it clearly 
appears that one should restrict to values of m that are not larger than 2n. We shall therefore 
now assume that M = {1,2 ,..., 2n}. 
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Applying (23) to (20), we get 


66 




f S, Sr; ) 

’ rriTVF J 


< inf I R (Am, , ^ n,s\ + 
m&M I V V In 


< inf ( (s,Sm) + 
m&M i \ 


m — 1\ ^ f m — 1 ^i/A^ 


2n J V2n(i/-1) 


n 


< 


( S, Sm* ) + 


{m* - 1) 
2ti 


+ 


m* — 1 V /S.r 
+ -^ 


2n{V -1) 


n 


(25) 

(26) 
(27) 


with m* defined by (24). We see from (26) that, up to the multiplicative constant 66, we 
have to optimize with respect to m a bound for the risk of Sm plus a residual term which 
depends in a non-monotonous way of V. The bound (27) shows that, up to a constant factor, 
we actually recover our benchmark (24) plus an error term which writes 


g{V-l) with g{x) = -f ^^—+ — 

n \ 2x J n 


Clearly, g{x) is minimum for x = xq = \J{m* — l)/(2Am*)- It follows that the optimal value 
of V is two if m* — 1 < 2Am*- This occurs in particular if m* = 1, for instance when Pg is 
the uniform distribution on [0,1] or close enough to it. It also occurs if Am > {m — l)/2 for 
all m >2. 

Let us now consider the situation for which m* — 1 > 2Am* so that xq > 1 and the optimal 
value of V belongs to (xq — 1,xo + 1). If (m — 1)/Am is an increasing function of m, the 
optimal value of V will be a non-decreasing function of m* which, as m* does, depends on 
the true unknown value of s, large values of m* leading to large values for V and vice-versa. 
For instance, the choice of equal weights. Am = log2n for m € M leads to F = 1 which 
satisfies (19) and to an optimal V of order y/{m* — l)/(21og2n). But this choice of Am is 
certainly not optimal in view of (25). A better one would be Am = (1/3) -|- 21ogm which 
also satisfies (19) but improves (25) substantially. Then the optimal value of V is of order 
y/(m* — l)/((2/3) -|- 41ogm*), still depending on the true unknown s. Only larger values of 
Am of the form Am = a{m — 1) for m > 2 that deteriorate the bound (25) and therefore 
should not be recommended lead to an optimal value of V which is independent of m *, hence 
of s. 


3.2 The typical situation 

A risk bound of the form (21) is actually not specific of histograms but actually rather typical. 
There are many procedures for which the risk, for a convenient choice of the index m and of 
the set A4 C M can be bounded in the following way: 


E, 


{s,Sm) < H{s,m) + Cmn 


-1 


(28) 


where 77 is a nonincreasing function of m, leading to an optimal choice m* for m (with respect 
to this bound which we take as a benchmark for the risk) given by 

m* = argmin^g_y\^ I H{s,m) + Cmn~^ >. 
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It then follows from (20) that we get an analogue of (27), namely 


66 




f s, S'- ) 

’ mTVF J 


/ • r J rr. ^ , CmV , 
< inf <H[s,m)-\ 7 V 7 — 7 T + 


mSA^ 


n{V -1) 


n 


< 


H{s,m*) + 


Cm* 


n 


1 

H— 
n 


Cm* 

1 


+ {V — l)Am* 


+ 


A, 


n 


and we see that the choice of V is driven, as in the case of regular histograms, by the quantity 

(V - l)-^Cm* + {V- 1)A^*. (29) 


The same arguments as before show that the optimal choice of V then depends on the ratio 
m* j Am* and therefore on s in many situations. This dependence of the optimal value of V 
with respect to the true density s will actually be confirmed by our simulations below. A 
density which is difficult to estimate by a histogram with a few bins will lead to a large value 
of m* hence a large optimal V while a simple density, for which m* is rather small, is better 
estimated by a V-fold with a small V. In the case of a finite set A4, which is the practical 
one, and of equal weights, which is the simplest but suboptimal choice, the optimal V varies 
like m*. 


3.3 Kernel estimators 

We consider here estimation of an unknown density s by a kernel estimator s^, using a non¬ 
negative kernel K and a positive bandwidth w which means that 

n 

Sw(x) = ^ IC^^(x - Xi) with Kia(y) = w~^K {w~^y^ . (30) 

i=l 

Although there are many papers which study the performance of kernel estimators, in par¬ 
ticular their risk with respect to Lp-type losses, we were unable to find a result about their 
non-asymptotic risk with respect to the squared Hellinger loss. This is why we provide one 
below, the proof of which is deferred to Section 6. 

Theorem 2. Let s be a density on the real line whieh is supported on an interval of length 
2L and sueh that has an L, 2 -niodulus of eontinuity 

<^2 (\/s, f]) = sup ||\/s (• -I- z) — -v/s|| = sup /i (s(- -|- z), sj . (31) 

\z\<ri \z\<ri 


Let 4> be a nondeereasing and eoncave funetion on [0, -|-oo) with (f){0) = 0 and (\/s, r\) < cfij]) 
for r] > 0. Assume moreover that the kernel K is bounded with f x^K{x) dx < -l-oo and that 
it is ultimately monotone around —oo and -|-oo. Then the kernel estimator Su, given by (30) 
satisfies 


E., 


h (su,, s) 


< 2 


y V K{x) dx 




2L\\K\\ 


+ 


C{K) 


(32) 


nw n 

where the eonstant C only depends on the kernel K and is equal to 1 when K is unimodal. 
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If we restrict to densities s with a known compact support, this bound takes the form 
(28) with the choice m = 'w~^ + 1. A “classical” smoothness assumption on ^/s corresponds 
to the choice </>(?/) = for some exponent a G (0,1]. In this case the smallest quantity 

M such that UJ2 Tj) < (f>{rj) holds true is merely the Besov semi-norm of y/s in the Besov 
space -B2 oo- In such a case, we see that the optimal value of ry is of order leading 

to a risk bound of order 77,-20/(20+1) _ jg completely analogous to what we get for the 

squared L2-risk, apart from the fact that for Bellinger we put the smoothness assumption on 
y/s instead of s. 


3.4 Handling arbitrary estimators 


The previous construction of TVF-estimators is only valid for genuine preliminary density 
estimators Sm, that is such that Sm(x) > 0 for all x G A and / Sm{x) dfj,{x) = 1, but this 
is definitely not the case for all classical estimators. For instance additive estimators given 
by (12) do not satisfy these requirements when the function JC may take negative values. 
This actually happens for projection estimators derived from wavelet expansions or kernel 
estimators based on kernels that take negative values. Not only TVF-estimators cannot be 
built from preliminary estimators that take negative values but the Bellinger distance cannot 
be defined for such estimators since it involves the square roots of the densities. There is 
actually a simple and reasonable solution to this problem which is to transform any function 
t such that > 0 into a probability density 7r(t) with respect to /r using the following 

operator vr: 


7r(t) = 


t VO 


(33) 


/ {t{x) V 0) dfi{x) 

It is clear that for any probability density s, |s(x) — {t{x) V 0)| < |s(x) — t{x)\ so that t V 0 
is closer from s than t for any reasonable distance, including all Lp-distances. Moreover 
the following lemma shows that h{s,TT{t)) < y/s — \/i V 0 which justifies the use of the 
transformation vr when dealing with the Bellinger distance. 

Lemma 3. Let f,g be two nonegative elements o/L2(/u) with \\f\\ = 1 and ||(7|| > 0. Let 
9 = S'/llsll so that ||g|| = 1. Then 

If s is a density with respeet to fi, g a nonegative element o/L2(//) with positive norm and 
u = (5/||<?||)^, then u is also a density with respeet to p, and 

h‘^{s, u) < 1 — — (\\Vs “ 1 ) < ||\/s — </||^ A 1. 

If, in particular, t is an arbitrary element o/Li(yr) such that f {t y 0) dp > 0, then 


hf{s, 7r(t)) < 1 — ^ -v/s — Vi V 0 A < y/s — \/t V 0 


A 1. 


Proof. Let ||g|| = A so that g = Xg and let p = (/,g) G [0,1]. Then 

11/- fflP = 1 + - 2A/3 and ||/- = 2(1 - p) < 2 . 
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It follows that, for a given value of p, the minimum value of ||/ — p|P is obtained for X = p 
and equal to 1 — which implies that 


/II/ -5 


-II \ 2 


< 


< 2. 


VII/- 5 II/ 1 + P 4-||/-p||2 

If / = ^/s, then p = p(s, u) = 1 — h?{s, u) and ||/ — g\\^ = 2h‘^{s, u), so that (34) becomes 

y-f 

1 + p(s, u) 2 — h^{s, u) 

and, since h{s,u) < 1, it also follows from elementary calculus that 


(34) 


h‘^{s,u) < 1 - yjl - (11/ - pP a 1) < 11/ - gf A 1. 

The last inequality is just the case of p = Vt V 0. □ 

Using the transformation vr amounts to replace the initial family {Am, m G by a new 
one {A'm, m G JXi} via the tranformation A'mO^k) = (^-4.m(Yfc)^ which results in procedures 

that now make sense for the Hellinger loss. Unfortunately, this transformation does not 
preserve the linearity so that if we apply this recipe to projection or kernel estimators, we 
cannot know whether the transformed estimators satisfy Assumption (LOSS). Nevertheless, 
as we have seen in Section 2.5, we may change the definition of TVF-estimators to (17) in 
order to solve this problem. 

Starting from a family of estimators that are not probability densities, we merely begin 
with a preliminary application of the transformation vr, as given by (33), and then define 
our modified TVF-estimator via (17) so that Theorem 1 applies to the family of procedures 
{A'm, m G M]. 


4 Empirical study 

The theoretical bounds that we have derived, for instance (20), are quite pessimistic because of 
the large constants that are present in our risk bounds. It is therefore crucial to know whether 
such large values are only artifacts or really enter the risk. In order to check the real quality 
of our selection procedure and evaluate the influence of the various parameters involved in it, 
we performed an extensive set of simulations the results of which are summarized below. 

4.1 Simulation protocol 

We studied the performances of the TVF procedure on 18 out of the 28 densities described 
in the benchden ^ R-package [17] which provides a full implementation of the distributions 
introduced in [4] as benchmarks for nonparametric density estimation. We only show our 
simulations for the eleven densities in the subset C = {sj, i = 1,2,3,4, 5, 7,12,13,22, 23,24} 
(where the indices refer to the list of benchden) the graphs of which are shown in Figure 1, 
except for the uniform density si on [0,1]. For a given loss I = h‘^,di or (respectively 
the squared Hellinger, Li- and squared L2-losses), we decided to evaluate the accuracy of 

^Available on the CRAN http://cran.r-project.org. 
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Figure 1: Graphs of all densities mentioned in the paper apart from the uniform. 

some estimator S' = by empirically estimating its risk R{s,s,£) = [1^(5, s)]. To do so, 

we generated 1000 pseudo-random samples X* = {X\,... 1 < i < 1000, of size n and 

density s and approximated R{s,s,i) by its empirical version: 

1000 

Rn{s,S,£) = — J2e[s,^X^)). 

i=l 

As in [14], we considered several families of estimators. We present here our simulations 
for the well-known problems of bandwidth selection for kernel estimators with a Gaussian 
kernel and the choice of the bin number for regular histograms. We therefore introduce the 
following families of estimators. 

• J-R is the set of regular histograms with bin number varying from 1 to [n/logn] as 
described in [10], 

• Tk is the set of Gaussian kernel estimators with bandwidths Wm of the form 

1 / 1 5 \ 

Wm = —, - 1 + ’ for m = 1,..., (logn)^ , 

n log n V log n / 


• -Fkr = Jk U Jr- 

Besides the classical VF methods, we considered two alternative procedures that are known 
to perform well in practice in order to have an idea of the performance of the T-V-fold as 
compared to some especially calibrated methods. When studying the problem of bandwidth 
selection, we compared the TVF with the unbiased cross-validation selector, implemented 
in the density generic function available in R, which provides an estimator which does not 
belong to the set {sm, m G M}. When dealing with the bin number choice we implemented 
the penalization procedure of Birge and Rozenholc (described in [10]) which selects a regular 
histogram in Jr. These two competitors will be denoted “UGV” and “BR” respectively in 
our study. To implement the TVF and process our simulations we used an algorithm which 
is described in Section 5 with the tests defined in (9) and constant weights = A = 0 for 
all m G Ai. 

We made thousands of simulations (varying the sample size n, the density, the family 
of estimators, the number V of splits in our V-fold procedures, etc.) but since the results 
we found were very similar in all situations, we only show the conclusion for n = 500 and 
V = 2,5,10 and 20. 
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4.2 The influence of the parameter 6 

As in [14, Section 5.1], we have studied the influence on the performance of the TVF procedure 
of the parameter 0 that is used in the definition of the test statistic (9). The parameter 
influences the performance of the tests as shown by (7) and (8) and therefore the whole 
procedure. Since on the one hand 0 = 0 corresponds to the KLVF and on the other hand 9 
must be less than 1/2, we made comparisons between the versions of stvf deduced from the 
tests with 0 G 0 = {1/16,1/8,1/4,3/8,7/16}. For the sake of clarity and to emphasize the 
stability of the behavior of the procedure in terms of risk, we present for each V the ratio 



which gives the largest difference in terms of risk among the densities in jC. The closer the 
ratio to 1, the more stable the procedure with respect to the variations of 0. We may conclude 


family 

V = 2 

V = 5 

V= 10 

V = 20 

Tr 

92,95 

94,87 

96,39 

96,96 

Tk 

91,31 

92,94 

94,79 

96,44 

Tkr 

87,81 

94,36 

97,48 

95,15 


Table 1: 100 times the ratio (35) for n = 500 and families Tk and Tkr- 

from this picture that 0 has little influence on the quality of the resulting estimator for families 
Tk and Tr, even if we did observe that 0 = 1/16 is in general slightly worse than the other 
values (in particular for the family Tr). Considering family Tkr, we have observed that there 
might be some noticeable difference for V = 2 for one specific density. Nevertheless no clear 
conclusion can be derived from our simulations as the best value of 0 varies with the setting. 
Finally, it appears that the choice 0 = 1/4 is always satisfactory and should be recommended. 

4.3 About the choice of V 

The main question when considering VF type procedures is maybe “which V is optimal?” or, 
more generally, “what is the influence of V on the quality of the VF procedure?”. According 
to our theoretical study in Section 3 the optimal value of V depends on the optimal value 
m* of m. In the case of equal weights the best V appears to be an increasing function of m*. 
In the case of histograms, if the best one has many bins, one should take a large value of V 
and the same would hold for a kernel estimator with a small bandwidth. To understand what 
actually happens in practice, we study here how the risk of the chosen estimator behaves 
when V varies. 

Since 0 has little influence, we made all the simulations with 0 = 1/4. We also implemented 
the calibrated procedures BR and UCV described in Section 4.1 in order to have a benchmark 
for the risk for the families Tr and Tk respectively. 

The empirical results summarized in Table 2 actually confirm what we derived from (29). 
The quality of the estimation increases with V when the true density is difficult to estimate 
which corresponds to an optimal estimator Sm* with a large value of m* in (29). For a simple 
density like the uniform si which is better estimated by an histogram with few bins, the best 
choice of V is 2 for the families Tr and Tkr which include histograms. On the contrary, 
when dealing with the family Tk for which si is not easy to estimate, we need to use a larger 
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family 

V 

Si 

S2 

S 3 

S4 

S 5 

sy 

S 12 

Sl3 

S 22 

S 23 

S 24 


2 

2,9 

10,4 

9,29 

13,8 

10,9 

11,4 

17,9 

14,5 

10,5 

20,8 

27,5 


5 

4,31 

9,9 

8,75 

12,7 

10 

10,6 

17,3 

13,5 

9,56 

18,4 

25,2 


10 

6,18 

9,81 

8,64 

12,3 

9,77 

10,6 

17,2 

13,7 

9,51 

17,8 

24,8 


20 

9,39 

9,65 

8,54 

12,2 

9,59 

10,4 

17,3 

14,1 

9,28 

17,9 

24,8 


BR 

2,20 

9,94 

9,27 

12,98 

10,53 

11,14 

17,85 

14,63 

10,37 

17,98 

25,15 


2 

15,4 

29,9 

5,67 

5,1 

3,56 

4,26 

28,5 

20 

3,96 

10,6 

18,1 


5 

12,7 

25,5 

5,06 

4,95 

3,61 

3,98 

23,4 

18,1 

3,86 

9,28 

16,2 

Rk 

10 

12,4 

24,3 

4,94 

5,01 

3,96 

4,04 

21,8 

17,7 

3,91 

9,08 

15,8 


20 

12,2 

23,5 

4,97 

5,41 

4,9 

4,27 

20,9 

17,6 

4,11 

9,05 

15,7 


UCV 

15,86 

22,20 

5,57 

6,16 

3,74 

4,10 

18,80 

17,16 

3,88 

9,52 

15,91 


2 

2,88 

10,4 

8,32 

6,35 

5,81 

6,57 

18,5 

14,4 

7,3 

12,8 

20 

Rkr 

5 

4 

9,91 

7,86 

5,64 

5,11 

6,06 

17,7 

13,2 

5,76 

9,66 

16,7 

10 

4,34 

9,95 

7,66 

5,64 

5,4 

6,18 

17,6 

13,7 

5,82 

9,12 

16 


20 

4,34 

9,86 

7,49 

5,91 

5,81 

6,5 

17,5 

14,5 

5,88 

9,08 

15,7 


Table 2: 10^ times the Hellinger risks of the TVF procedure. 


value of V. A similar situation occurs with densities S 4 , S 5 , sj and S 22 which appears to be 
easily estimated by a kernel estimator with a large bandwidth but poorly by histograms. It 
seems that, apart from the exceptional situation of si, the best value of V is not 2 and the 
most significant gain appears between V = 2 and V = 5, then the quality sometimes keeps 
improving from V = 5 to V = 20, but with very little difference between V = It) and V = 20. 

Interestingly, we also observe that when using the mixed collection Tkr the TVF pro¬ 
cedure shows a good adaptation behaviour since it selects the best family in all settings. 
For instance for S 5 it chooses a kernel estimator since these are better than histograms for 
estimating it, whereas it selects an histogram for S 2 for the opposite reason. 

The numerical complexity of the TVF procedure is quite important in practice and in¬ 
creases with V so that large values of V should be avoided because they lead to a much larger 
computation time. In particular the Leave-one-out (V = n) should be excluded since it is typ¬ 
ically impossible to compute it in a reasonable amount of time. Of course, since the optimal 
value of V, as we have seen, depends on unknown properties of the procedures with respect 
to the true density it is not possible to practically define an optimal choice of V. Neverthe¬ 
less our empirical study suggests that a good compromise, which leads to both a reasonable 
computation time and a good performance (apart from some exceptional situations like the 
estimation of the uniform by histograms), is V = 5. We would therefore recommend the user 
to process the TVF procedure with this value. 


4.4 Comparison with others VF procedures 


The goal of this section is to compare our TVF procedure with other general VF procedures 
namely LSVF and KLVF, which do not depend on the family of estimators from which we 
estimate s. In order to compare two VF procedures ti and t 2 , we consider the log 2 -ratio of 
their empirical risk. 


Ws{hj2,i) 


^ Rn{ii,S,i) 


Thus VFs (ti, = c means that (ti, s, £) = 2^^ x Rn {i 2 , s,i). Hence, for a given density 
s, ^2 is a better estimator than H if c > 0. In our empirical study, a selection procedure £2 
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is thus considered better than ii in terms of risk for a given loss function i if the values of 
Ws{ii,i 2 ,i) are positive when the density s varies in C. 

Rather than presenting exhaustive results, that is the evaluation of Ws for all densities s 
in C, different loss functions, various observations numbers n and different choices of V, we 
shall illustrate the results of our simulations by showing boxplots of {Ws{ti,t 2 , i), s G 
with the discriminating value zero emphasized in red. We actually observed similar results 
and behaviours for all losses and all sample sizes and therefore present here only the results 
for i = h'^ and n = 500 for the sake of simplicity. Figure 2 is built with ii = ^^sve 
line) or (bottom line) and ^2 = ^mxvF 6 = 1/A. 




Figure 2: From left to right, the boxplots lFs(s,s^ , ii^) using families .Fk,.Fr, J^kr (up for 
s = , down for s = ). Each subfigure shows the boxplots for V = 2,5,10 and 20. The 

horizontal red dotted line indicates the reference value 0. 

In nearly all cases, the median and most of the distribution are positive, which means 
that the TVF outperforms LSVF (with an average gain of about 20% for the three families 
of estimators Fk, Fr and Fkr) and KLVF as well. For the collection Fk we observe that 
the empirical risks of TVF and KLVF are similar with boxplots of ^s(®mj,,LVF’^”1 tvf’ 
highly concentrated around zero. But there is a huge difference between TVF and KLVF 
procedures for families Fr and Fkr (average gain of about 100% and 180% respectively). 
For the uniform density estimated with regular histograms, the estimator derived from our 
procedure is worse since we found, for both classical VF, IFs^(S, /i^) < 0 (with an 

increasing difference with V for Fr). Finally, let us notice that the difference between TVF 
and classical VF does not change much with V. 
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5 Our computational algorithm 

For the practical computation of the TVF as well as any other VF procedure, we assume that 
A4 is finite with cardinality M. 

Let us compare the complexity of a classical F-fold method with ours. Since for every 
VF method the construction of all partial estimators (smj)i<j<v,i<7n<M is required, we only 
have to focus on the “validation part” which requires to compute all quantities 'D‘j{m) for 
1 < J < 1^ and m € M and therefore to perform all tests for 1 < j < V and 

l,m € M with I 7 ^ m. This means performing V x M x (M — l)/2 tests leading to a 
computational cost of order 0{V x M^) that can be prohibitive as compared to the one of 
either LSVF or KLVF which have a maximum complexity of order 0{V x M) (since in this 
case no more than M calculations are needed for each split). For instance, a 10-fold with 
100 different procedures would require at most 1000 evaluations for a classical VF whereas 
we would need the computation of 49500 tests for the TVF. It is already huge and does not 
even take into account the computation of the distances Smj), each one requiring the 

evaluation of an integral. Therefore a “naive” algorithm based on the computation of all the 
V X M values would be very slow. 

Fortunately, there is a smarter way to determine which m minimizes T){-) over A4. Our 

algorithm is inspired in some way by the one described in [14, Section 3]. In order to explain 

how this “fast” algorithm works, it will be convenient to single an element of A4, that we 

shall denote by “m^”, to serve as a starting point for our algorithm which begins with the 

computation of T){ms). We store in R the minimal value of those V (m) that have already 

been computed and in opt the corresponding optimal value of m with initial values opt = rus 

_ 2 _ 2 

and R = V {ms). We update them after each computation of a new T) (m) such that 

T> (m) < R, then setting opt := m and R\=T) {opt) so that R can only decrease during the 

computational procedure. 

By (11), minimizing R {m) is equivalent to minimizing ^^]^I?|(m). Since 
T)j{m) = sup (sij,Smj) ]1|.0 (X)=«} Mm = M\{m}, 

i&M m 

one can compute it iteratively, starting with Cj{m) = 0 and setting 

Cj{m) := max (^Cj{m),h^ {sij,Sm,j)'^ when ipi^rn{^j) = I for I £ M-m- 

If '4’i,m{^j) = m we can instead update Cj{l) by Cj{l) := max{Cj{l),h‘^{sij,Sm,j)) using 
the result of the test i/ji^rn^^j) for the calculation of both 'D‘j{m) and Rj{l)- Our algorithm 
proceeds in this way, with a set of M V-dimensional vectors C.{m), m £ M, initially set to 
zero. The updating procedure of Cj{m) stops when all updates, with I £ Aim, have been 
done (which means that the present value of Cj{m) is T>|(m)) and we finally set R {m) = 
Ej=iC,{m). 

We also use another trick in order to shorten our computations. Since Cj{m) can only 
increase during the updating procedure, ^j{'m) is, at any time, a lower bound for R (m), 
whatever m £ Ai. Therefore it is useless to go on with the computation of the vector C.{m) 
if >Cj(m) > R since then R {m) > Y^=i >Cj(m) cannot minimize the function R{-) over 
Ai. Taking this fact into account, we denote by ^ C A4 the set of all procedures which are 
potentially “better” than the current optimal one stored in opt. This means that we store in 
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_2 

^ all m G for which we do not yet know whether T> (m) < R oi not and each time we 
find m such that Y^=i remove it from Q. We also remove m from Q once 

we have computed T> (m) with m ^ Q and then proceed with the computation of some new 
vector C.{1) for I G Q until Q is empty and the algorithm stops with the final value m = opt. 

Algorithm 1: Selection of the TVF estimator 

Initialization : 

1 Set Q = Aims and opt = 

2 for {I e A4) do 

3 for (j = I, . ■ ■ ,V) do 

4 I £,(0=0 

5 end 

6 end 

1st step: 

7 for {I e Q) do 

8 Compute ’ipms.ii^j) 

9 if {ipms.iO^i) = ms) then 

10 I £.j{l) = h {si.jySms.j) 

11 else 

12 I Cj(ms) = max(Lj(ms),h'^(sij,'sms,j)) 

13 end 

14 end 

15 Set R = and e = e \ {Z e e : A (0 > R} 

Next steps: 

16 while {\ 0 \ > 0) do 
Choose m £ Q and set Q = Q \ {m} 
for (j = 1,..., V") do 

for (Z e Aim) do 

Compute tpm.ii^j) 11 if it has not been done yet 

if {ipm,i{^j) = m and I £ Q) then 

(0 —max(£,' (Z), .j^ ^m.j)) 
if (X]ili'^i(0 > R) then 

I G = g\{i} 

end 

end 

if = Z) then 

£j(m) = max(£j(m), Zi^(?!.j Am,,)) 

if > R) then 

I break // quit the two "for" loops 

end 

end 

end 

end 

if (E,'li AM < R) then 

I Set opt = m, R = ttj{m) and Q = Q\{1 £ Q ■■ A(0 > 

end 

38 end 

39 Return opt 

Some important remarks 

• The algorithm is designed to work with any test procedure ^ which satisfies Assumption 
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(TEST) or, more generally, Assumption (TEST’), like the procedures based on the 
statistics (9) or (18). 

• It is important to notice that, at any step, we cannot “delete” once and for all the 
procedures which do not belong to the set G. Even if we do not compute the value of D 
for these procedures, we still need to test them against the remaining procedures in G- 

• We hoped that by starting from a good initial estimator, only a few procedures would 
be in the first set G, resulting in just a few tests. In the simulations we always started 
from nis = m-LgvF since the computation of mnSYF is less costly than the one of mTVF 
and provides a good starting point. If T’(mLSVF) = 0 at the first step the algorithm 
stops immediately and the chosen procedure is m = fhhSYF- In tbis special case, the 
complexity of our algorithm is the same as the one of the classical approach. 

• Clearly, the choice of m at line 17 of the algorithm, as well as the choice of the starting 

procedure, have no influence on the final estimator, only on the computational time. 
To avoid a quadratic complexity, we need to ensure that we don’t “jump” to the worst 
procedure inside the set G at each iteration. In our simulations, we chose to jump to 
the statistical method k € G with the lowest temporary criterion among the procedures 
in G, that is k = argmin^g^ We also tried two alternative options: jumping 

to k = argmax;gg nnd to the most chosen statistical method A: in ^ against 

m. Both options lead of course to the same final estimator but were definitely slower. 


6 Proof of Theorem 2 


First note that the kernel estimator can be written, according to (30), where 

Pn = denotes the empirical measure based on the i.i.d. sample Ai,...,A„. It 

then follows from the triangle inequality that 


E., 


h (^Syj , s) 


= -lEs 

2 


a / Ku, * Pn - Vs 


< 


Vs — VP-w * s 


+ Ts 


V Kw * Pn — V Kw * s 


(36) 


which is the usual bound of the risk as squared bias plus variance, and we shall bound both 
terms successively. 


6.1 Bounding the bias 


It is well known that whenever the function s belongs to L 2 , the quality of approximation of 
s by the convolution * s depends on the modulus of continuity a; 2 (s, •) of s in L 2 as given 
by (31). If we consider the Bellinger distance instead of the L 2 -distance it is expected that 
the quality of approximation should rather depend on the the modulus of continuity of -y/s 
instead. The control of the bias term is provided by the following lemma: 


Lemma 4. Let s and K be some density funetions with respeet to Lehesgue measure on the 
real line. Let g = V^ and assume that uj 2 {g,rj) < for every nonnegative rj and some 
nondeereasing and coneave function cf on [0, 00 ) with (j){0) = 0. Then for every positive real 
number w 


-VK 


* s 


< 2 


y V K{x) dx 


(V{w) 


(37) 
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Proof. The key of the proof is to compare = g — ^2 with A'^ = \\g - Kw * g\\‘^. 

Our arguments are easier to explain within a probabilistic framework. Let ^ be some random 
variable with density K with respect to the Lebesgue measure. Then the convolution operator 
can be written as 

{Kw * g){x) = IE[gr(x — for all x G M. 

From Jensen’s inequality we know that 


E[g{x - rcC)] < [g‘^{x - w^)], 

or equivalently a/* g^ > * g, and a fortiori, 

J g{x)^{K^*g^){x) dx > J g{x){K^ * g){x) dx. 
Expanding the square norms, we derive from (38) that 


(38) 


- A^ < 




- \\K^*gr 


The trick is to notice that 


_ 2 p 

^jK^*g‘^ - \\Kw * g\\^ = / Var {g{x - w^)) dx. 

«/M 

Now since the computation of the variance is not sensitive to the substraction of a constant 
Var (^g(x — = Var (^g(x — wf) — g{x)j < E (^g{x — wf,) — g{x)'j 

and Fubini’s Theorem implies that 




- \\K^ *gf < E uj2{g,w\^\) 


This achieves the first step of the proof. The second step is straightforward. We just have to 
bound which is an easy task since 


= 


J^(E[g{x - w() - g{x)]J dx 


implies by Jensen’s inequality and Fubini’s Theorem that 


A^ < E 


g{x — wff) — g{x)\ dx 


< E 


^l{g,w\^\) 


Collecting these bounds, we derive that 


< 2E 


^l{g,w\i\) 


< 2E 




It remains to decouple w and in the last expression above. This can be done by noticing 
that the monotonicity and concavity properties of cj) imply that 4>{w\f^\) < (1 V \f,\)4>{w) and 
the result follows. □ 
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6.2 The variance term 

We now turn to the analysis of the variance term of the Hellinger risk of a kernel estimator. 
Lemma 5. Let us denote by the Borel set {x G M | {K^ * s){x) > 0}, then 


Eo 


V Kw * Pn — \/ Kw * S 


< 


1 


{{K'^)w * s){x) 


nw Ja^ {Kw*s){x) 
In partieular if s is supported on an interval of finite length 2L, then 


E, 


V Kw * Pn — \l Kw * S 


< 


1 


sup K 


nw J -l<z<l 


x — z 
w 


dx. 


dx. 


(39) 


(40) 


If the kernel K is bounded, non-deereasing on (—oo,Mi) and non-inereasing on (M 2 ,+oo) 
with Ml < M 2 , then 


sup K 

—L<z<L 


X — Z 


w 


dx 


< 2L||i("||oo + w 


/ Ml poo 

K{x) dx-\- K{x) dx 
-00 J M 2 


(41) 


If, in partieular, K is unimodal, then 


E, 


Proof. Since, for u,v >0, 


\/ Kw * Pn — V Kw * S 


“ nw n 


{u - u)^ = {\/u - y/vf ( a/u + y/vf' > V {^/u - a / u )^ , 


it follows that 


/ 




2 

dx < 


[u{x) — u(x)]^ 


/v{x)>0 


v{x) 


dx + 


lv{x)=0 


u{x) dx, 


hence 


V Kw * Pn — 'J Kw * S < [ 

JA 


[(Ku, * Pn){x) - {Ku, * S)(x)]' 


{Kyj * s)(x) 


dx-I {Kuj * Pn){x) dx. 
JA9.. 


By Fubini and the definition of 


E, 




(i^^ > 1 = Pn){x) dx 


[ Es[{K^ * Pn){x)]dx = [ {K^ * s){x) dx = 0. 
JAz JAz 


Taking expectations and using Fubini again, we therefore get 


E, 


V Kw * Pn — V Kw * S 


Eo 


< 


[{Ky, * Pn){x) - Kyj* s(x)]' 


{Kyj * S(X) 
Var((K^*P„)(x)) 


dx 


Kyj * s(x) 


dx 
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and (39) follows since 


, . Varfc(x-X)) 

Var ({K^ * Pn){x)) = -^^ < — 


[k^{x-X))' 


{{K‘^)w * s){x) 


n 


nw 


■ (42) 


Now observe that if / is supported on [a, a + 2L\ 
(( k 2 )^* s )( x ) = 
so that by (39) 


l>a+2L I / j. _ ^ 

—K 
w 


( -^ s{z) dz < sup K ( -) {Kui * 'S)(x), 

\ W J a<z<a+2L \ W 


E., 


a / * Pn — a / Kjt, * S 


1 f f ^ ~ , 

< — / sup K - I dx. 

nw J a<z<a+2L \ W 


Since 


sup K 


X — z 


J a<z<a-\-2L \ 

(40) follows. 


] dx = i 

1 sup K 

/ J 

—L<y<L ^ 


w 


-L<y<L \ W 


V -y 


dv, 


If K is nonincreasing on (M 2 , +oo) and x > M 2 W + L, then sup_ 2 ,<j^<L K [w ^(x — y)) = 
K (w~^(x — L)) and 


.up K{^]dx=r iff —)rfx = u,rA-to)dp. 

Jm2w+l \ w j Jm2 


JM 2 W+L -L<y<L \ W 


Similarily, 

and finally 
sup K 


/ Miw—L / ,j. y\ 

sup K (- \dx = w K{y) dy 

-00 —L<y<L \ W J J—00 


x-y 


/ Ml POO 

K{x)dx + w / K{x)dx, 
-00 J M2 

which is (41). The unimodal case immediately follows from (40) and (41) with Mi = M 2 . □ 
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A Supplementary material 

We provide here additional simulations about the TVF based on the test statistic Tt^uO^) 
designed by Baraud in [3] and given by (18). As in Section 4, we study the influence of 
V and we compare the TVF based on this test with classical VF procedures. The results 
are summarized in Table 3 and Figure 3 which are the analogues of Table 2 and Figure 2 
respectively. 


family 

V 

Si 

S2 

S3 

S4 

^5 

sy 

Sl2 

Sl3 

S22 

S23 

S24 


2 

2,89 

9,97 

9,07 

13,2 

10,5 

11 

17,5 

14,7 

10,3 

19,9 

26,9 


5 

4,33 

9,68 

8,61 

12,4 

9,87 

10,4 

17,1 

13,4 

9,37 

17,8 

24,7 

-Tr 

10 

6,13 

9,65 

8,56 

12,1 

9,65 

10,4 

17 

13,7 

9,36 

17,5 

24,3 


20 

9,28 

9,47 

8,4 

12 

9,36 

10,3 

16,9 

14,2 

9,17 

17,4 

24,6 


BR 

2,20 

9,94 

9,27 

12,98 

10,53 

11,14 

17,85 

14,63 

10,37 

17,98 

25,15 


2 

15,6 

29,4 

5,69 

5,07 

3,55 

4,24 

27,2 

20 

3,97 

10,3 

18 


5 

13,2 

25,7 

5,1 

4,94 

3,58 

3,97 

23 

18,1 

3,85 

9,18 

16,2 

T-k 

10 

12,9 

24,8 

5 

5,02 

3,86 

4,01 

22,2 

17,7 

3,87 

9,04 

15,8 


20 

12,7 

24,4 

4,98 

5,28 

4,54 

4,1 

21,6 

17,6 

3,98 

8,98 

15,8 


UCV 

15,86 

22,20 

5,57 

6,16 

3,74 

4,10 

18,80 

17,16 

3,88 

9,52 

15,91 


2 

2,87 

10 

7,47 

5,88 

5,04 

5,6 

18,9 

14,7 

6,38 

11,6 

19,1 

T'kr 

5 

3,68 

9,77 

6,81 

5,48 

4,64 

5,19 

17,7 

13,3 

5,01 

9,3 

16,4 

10 

3,58 

9,84 

6,71 

5,53 

4,99 

5,26 

17,6 

13,7 

5,11 

9,04 

15,9 


20 

3,79 

9,84 

6,45 

5,65 

5,31 

5,83 

17,6 

14,6 

5,22 

9,01 

15,7 


Table 3: 1000 times Hellinger risks for the TVF procedure based on Baraud’s test. 


Influence of the test on the TVF 

We compare here the performances of the best TVF procedure (among the five values of 0 
described above) derived from Birge’s test (9) against the one deduced from Baraud’s test 
(18) (denoted We show the conclusion of our study for the families Tr, R'k and 

n = 500 and V = 2,5,10 and 20. The results are very similar for other values of n. For the 
sake of clarity and to emphasize the similarity of both procedures in terms of Hellinger risk, 
we present for each family, for each V, the supremum and the infimum over C of the ratio 

If inf^g/; T(s) > 1 the TVF using Baraud’s test behaves in a better way than the one using 
Birge’s test for all densities in £ while if sup^g^ T(s) < 1 the opposite holds. The closer the 
two values, the more similar the quality of both procedures. 
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Figure 3: From left to right, the boxplots of Ws (s, stvFj^^) using families J'kj-Fr and J'kr (up 
for s = 's^ , down for s = 's^ ). Each subfigure shows the boxplot for V = 2,5,10 and 20. The 

uilsvf rriKi^vF' “ 

horizontal red dotted line provides the reference value 0. 


family 

T(s) 

V = 2 

V = 5 

V = 10 

O 

CN 

II 

-Tr 

sup^ 

infs 

103,68 

98,16 

102,59 

100,07 

101,72 

99,59 

102,27 

99,13 

-Fk 

SUPs 

infs 

102,78 

99,58 

100,80 

98,72 

100,92 

97,45 

105,10 

96,13 

Fkr 

SUPs 

infs 

116,71 

96,70 

115,80 

98,84 

116,79 

99,08 

116,73 

99,30 


Table 4: Supremum and infimum of 100 times the ratio, see the text. 


We see from this table that Baraud’s and Birge’s test are very similar to process the TVF 
procedure for families and Tk- There is indeed no noticeable difference for these families, 
the largest gain (for a density in C) being of 5% only. The procedure based on Baraud’s test 
becomes much better for the family Tkr- We observe indeed that a potential gain of 15% 
appears (since the sup^ is close to 115%) while the loss is negligible (since the inf^ is close 
to 99%). Moreover, the ratios are quite similar when V increases. Finally, let us recall that 
the TVF procedure based on (9) is less time-consuming since it requires to compute only one 
integral instead of two for (18). 
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